Skip to content

Conversation

@lucaslie
Copy link
Member

@lucaslie lucaslie commented Oct 30, 2025

Description

  • Configurable cache config via yaml or CLI args
  • Utility to merge cache config from user and the factory (user takes precedence)
  • Fix in triton mamba kernel to ensure ssm state dtype is respected

mamba cache can be configured via

args:
  transforms:
    insert_cached_ssm_attention:
      cache_config:
        mamba_dtype: float32

or

--args.transforms.insert_cached_ssm_attention.cache_config.mamba_dtype=float32

for extra_llm_args in trtllm-bench or serve remove the args prefix

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@lucaslie lucaslie requested a review from a team as a code owner October 30, 2025 20:56
@lucaslie lucaslie requested a review from MrGeva October 30, 2025 20:56
@lucaslie lucaslie self-assigned this Oct 30, 2025
@lucaslie lucaslie moved this from Backlog to In review in AutoDeploy Board Oct 30, 2025
@lucaslie lucaslie linked an issue Oct 30, 2025 that may be closed by this pull request
1 task
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

📝 Walkthrough

Walkthrough

These changes extend the Mamba SSM caching infrastructure by introducing a dedicated dtype configuration option, updating metadata tuple signatures to include additional state information, and refactoring the Triton backend to inherit from the Torch backend implementation to eliminate duplicate interface methods.

Changes

Cohort / File(s) Summary
Cache Configuration Enhancement
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Added optional mamba_dtype field to CacheConfig dataclass to support dtype-specific caching logic for Mamba operations.
Torch Backend Mamba Updates
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py
Updated get_prepare_metadata_op to return 4-tuple including use_initial_states flag. Modified get_cache_initializers to derive dtype from source node metadata and use cache_config.mamba_dtype as fallback for SSM cache construction.
Triton Backend Refactoring
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
Changed TritonBackendSSM inheritance from AttentionDescriptor to TorchBackendSSM. Removed 8 duplicate methods (is_paged, get_attention_layout, get_num_qkv_args, get_source_attention_op, get_prepare_metadata_op, get_cache_initializers, get_global_buffer_initializers, get_constants). Retained only get_cached_attention_op as the public interface. Updated imports accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant TorchBackendSSM as get_cache_initializers<br/>(Torch Backend)
    participant NodeMeta as source_attn_node<br/>.meta["val"]
    participant CacheConfig
    participant SSMCache as ssm_state_cache

    Caller->>TorchBackendSSM: get_cache_initializers(source_attn_node, cache_config)
    TorchBackendSSM->>NodeMeta: Extract dtype
    NodeMeta-->>TorchBackendSSM: dtype (from node metadata)
    TorchBackendSSM->>CacheConfig: Check cache_config.mamba_dtype
    alt mamba_dtype available
        CacheConfig-->>TorchBackendSSM: mamba_dtype
        TorchBackendSSM->>SSMCache: Create with mamba_dtype
    else mamba_dtype not set
        CacheConfig-->>TorchBackendSSM: None
        TorchBackendSSM->>SSMCache: Use node dtype as fallback
    end
    SSMCache-->>Caller: Cache with resolved dtype
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • triton_backend_mamba.py — Requires careful verification that inheriting from TorchBackendSSM correctly replaces all 8 removed methods and that the public API contract is maintained. Ensure no unintended behavioral changes from inheritance-based method resolution.
  • torch_backend_mamba.py — Verify dtype resolution logic correctly prioritizes cache_config.mamba_dtype over node dtype, and confirm the expanded 4-tuple return from get_prepare_metadata_op is handled correctly by all callers.
  • attention_interface.py — Confirm backward compatibility; the optional field should not affect existing code paths that don't use it.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning PR description lacks proper structure; missing concrete implementation details under Description and Test Coverage sections that are marked incomplete. Fill in the Description section with a clear explanation of the issue and solution, and the Test Coverage section with relevant test cases that validate the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title '[#8763][feature] AutoDeploy: configurable dtype for caching' directly matches the main code changes, which introduce an optional mamba_dtype field for configurable cache dtype handling across attention and mamba backend implementations.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@suyoggupta
Copy link
Collaborator

what's the status of this PR? @lucaslie

@lucaslie lucaslie force-pushed the ll/fp32_mamba_cache branch from 518086c to 45f971e Compare November 10, 2025 19:14
@lucaslie lucaslie changed the title [#8763][fix] AutoDeploy: correct mamba cache dtype extraction [#8763][feature] AutoDeploy: configurable dtype for caching Nov 10, 2025
@lucaslie
Copy link
Member Author

/bot run

@lucaslie lucaslie requested a review from 2ez4bz November 10, 2025 19:20
@tensorrt-cicd
Copy link
Collaborator

PR_Github #24045 [ run ] triggered by Bot. Commit: 45f971e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24045 [ run ] completed with state SUCCESS. Commit: 45f971e
/LLM/main/L0_MergeRequest_PR pipeline #18119 completed with status: 'FAILURE'

Copy link
Member Author

@lucaslie lucaslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a perf regression that I am still looking into

edit: re-ran the benchmark and looks good, see results below: #8812 (comment)

@lucaslie
Copy link
Member Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24062 [ run ] triggered by Bot. Commit: 45f971e

@lucaslie
Copy link
Member Author

lucaslie commented Nov 11, 2025

Bf16 pre:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     2.8041
Total Output Throughput (tokens/sec):             2871.3826
Total Token Throughput (tokens/sec):              5742.7653
Total Latency (ms):                               91295.3912
Average request latency (ms):                     59847.7003
Per User Output Throughput [w/ ctx] (tps/user):   19.4070
Per GPU Output Throughput (tps/gpu):              2871.3826

Bf16 post:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     2.8026
Total Output Throughput (tokens/sec):             2869.8821
Total Token Throughput (tokens/sec):              5739.7642
Total Latency (ms):                               91343.1245
Average request latency (ms):                     59720.5701
Per User Output Throughput [w/ ctx] (tps/user):   19.4566
Per GPU Output Throughput (tps/gpu):              2869.8821

Bf16 post with fp32 cache

OOM on TP=1 with consistent settings

FP8 Pre:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     5.0824
Total Output Throughput (tokens/sec):             5204.3830
Total Token Throughput (tokens/sec):              10408.7660
Total Latency (ms):                               50369.8519
Average request latency (ms):                     49102.8390
Per User Output Throughput [w/ ctx] (tps/user):   20.8603
Per GPU Output Throughput (tps/gpu):              5204.3830

FP8 Post:

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     5.0757
Total Output Throughput (tokens/sec):             5197.5544
Total Token Throughput (tokens/sec):              10395.1089
Total Latency (ms):                               50436.0278
Average request latency (ms):                     49248.4491
Per User Output Throughput [w/ ctx] (tps/user):   20.7990
Per GPU Output Throughput (tps/gpu):              5197.5544

FP8 Post with fp32 cache

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     4.9643
Total Output Throughput (tokens/sec):             5083.4690
Total Token Throughput (tokens/sec):              10166.9380
Total Latency (ms):                               51567.9353
Average request latency (ms):                     50402.6484
Per User Output Throughput [w/ ctx] (tps/user):   20.3229
Per GPU Output Throughput (tps/gpu):              5083.4690

@suyoggupta
Copy link
Collaborator

can we check-in a nano_v3.yaml to the repo so that we have a version controlled source of truth on the model config

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24062 [ run ] completed with state SUCCESS. Commit: 45f971e
/LLM/main/L0_MergeRequest_PR pipeline #18134 completed with status: 'FAILURE'

@lucaslie lucaslie requested a review from a team as a code owner November 11, 2025 02:04
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
@lucaslie lucaslie force-pushed the ll/fp32_mamba_cache branch from d8e8339 to 10cb1b3 Compare November 11, 2025 02:07
@lucaslie
Copy link
Member Author

/bot run

@suyoggupta
Copy link
Collaborator

can you please also post perf with mamba cache set to fp32?

@lucaslie
Copy link
Member Author

lucaslie commented Nov 11, 2025

can you please also post perf with mamba cache set to fp32?

updated the perf comment with TP=1, fp32 cache, fp8 checkpoint. Same settings run OOM for bf16 with fp32 cache. Do you want any other perf measurements?

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24078 [ run ] triggered by Bot. Commit: 10cb1b3

@suyoggupta
Copy link
Collaborator

thanks for adding this. Won't ask for more :)

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24078 [ run ] completed with state SUCCESS. Commit: 10cb1b3
/LLM/main/L0_MergeRequest_PR pipeline #18145 completed with status: 'SUCCESS'

@lucaslie lucaslie merged commit 6bf4e59 into NVIDIA:main Nov 11, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in AutoDeploy Board Nov 11, 2025
@lucaslie lucaslie deleted the ll/fp32_mamba_cache branch November 11, 2025 06:17
@lucaslie lucaslie restored the ll/fp32_mamba_cache branch November 11, 2025 08:23
@lucaslie lucaslie deleted the ll/fp32_mamba_cache branch November 11, 2025 08:29
suyoggupta pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: AutoDeploy: Allow for specifying mamba cache dtype

4 participants